TD - Chapter details v2

Chapter details

Greek Tokenizer and Sentence Splitter

The tokenizer is one of the core preprocessing modules of the Greek natural language processing chain. It may run either in training or in run time mode. In both cases the tool first applies tokenization to the input text or file, treating any non-alphanumeric character as a separate token. Furthermore, the tool supports the recognition of Greek and Latin characters, which are split in the preprocessing phase. Finally, the tokenizer uses an svm-based sentence splitter to locate punctuation symbols in order to mark the end of sentences. The built-in classifier can detect whether punctuation denotes an abbreviation or the end of a sentence. Overall, the tokenizer recognizes tokens, abbreviations, non-Greek (Latin) tokens and sentence boundaries and composes the first part of the Greek LPC chain. The tool is part of the open source tools provided by the AUEB Natural Language Processing Group (http://nlp.cs.aueb.gr/software.html) and specifically of the Greek Part-of-Speech tagger (version 1).

Greek POS Tagger

The Greek part-of-speech (POS) tagger uses a k-nearest neighbour classifier in order to automatically determine the part of speech (e.g. noun, adjective, verb, etc.) of each word occurrence in Greek texts. It can be used however to tag each word with additional grammatical information, such as the gender, number, and case of each noun, the voice, number and tense of each verb, etc. The tool is open source and as a standalone application it includes a GUI and active/passive learning functionalities. Although a pre-trained set of resources exist, the POS tagger could be retrained and can be configured to use alternative settings depending on the incoming text content. The original tool (version 1) has been extended and optimized in terms of performance, reliability, bug removal, etc. and in terms of adding multi-core support in sections where this was feasible. Within ATLAS a UIMA wrapper has been developed and integrated to the Greek Language Processing Chain as a primitive engine.

Greek Lemmatizer

The Greek lemmatizer has been developed as a separate primitive engine of the Greek Aggregate LPC chain and is executed after performing tokenization, sentence extraction and POS tagging in the input text. This module is a morphological lemmatizer for the Greek language, thus for a given word/token, the tool forms the exact corresponding lemma taking into account the grammatical information assigned to the token.

Greek Noun Phrase Extractor

The noun phrase extractor is based on the Spejd tool (java version v0.84), an open source shallow parsing and disambiguation engine (http://zil.ipipan.waw.pl/Spejd) that was adapted to support the Greek grammar and the corresponding set of rules for identifying Noun-Phrases in a text. The tool has been extended for the Greek language within ATLAS project and can be further enhanced in order to include additional shallow parsing annotations given an input text (e.g. verb phrases, etc).

Greek Named Entity Extractor

The Greek Named Entity Extractor is partially based on the open source library “Named-entity recognizer for Greek texts” (version 2) developed by the AUEB group (http://nlp.cs.aueb.gr/software.html). The original tool supports extraction of temporal expressions, person names, organization names using semi-automatically produced regular expression patterns for the temporal expressions and an ensemble of Support Vector Machines (SVMs) for person and organization names. The extended version developed within ATLAS identifies additional location names based on the same algorithms used for the person and organization names and regular expression patterns for money, percentage extraction from an input text. The named entity extractor utilizes internally a sentence splitter (if needed), which is the same described above (Greek Tokenizer & Sentence Splitter). The software of the named-entity recognizer is released under the GNU GPL, and it requires LIBSVM, which is available from: http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Technical

Greek LPC

The Language Processing Framework

Common language processing tools

Bulgarian LPC

English LPC

German LPC